Image processing

ABSTRACT

An image recognition process ( 15 ) applied to a photograph is preceded by a pre-processing step which identifies parts of the image which are in focus. The recognition process is then limited to those parts. The pre-processing typically operates by creating ( 10 - 12 ) a measure of high spatial frequency activity and applying a threshold to them ( 13, 14 ).

[0001] The present invention is concerned with image processing, andmore particularly with—in a broad sense—image recognition. Byrecognition, here, is meant that the image is processed to produce someresult which makes a statement about the image. There are a number ofdifferent contexts in which this can be useful.

[0002] For example, if the image is of a single object, it may bedesired to identify the object as being as a specific one of a number ofsimilar objects: the recognition of human face as being that of aparticular person where picture is stored in a reference database wouldfall into this category. Alternatively it may be desired to identify animage as containing one or more pictures of objects and to classify itaccording to the nature of those objects. Thus the automation of theprocess of indexing or retrieval of images from a database could befacilitated by such recognition, particularly where a large database (ora large number of databases, as in the case of internet searches) isinvolved. Recognition may be applied not only to still pictures but alsomoving pictures—indeed, the increasing availability of audio-visualmaterial has identified need to monitor material transmitted ontelevision channels, or via video on demand systems, perhaps to verifythat a movie film transmitted corresponds to that actually requested.

[0003] Currently there exist recognisers for various midrange featuresof images, for example the presence of vertical structures, skin regionsor faces and whether the photograph is taken in or out of doors. Onecould envisage large networks of such recognisers working in combinationto make higher level statements about an image. For example verticalstructures with skin tones may be seen as good evidence for people,especially if a face can be found in the vicinity.

[0004] Given such a system we could create a list of objects andfeatures for every image. However, even with this information we wouldhave difficulty describing what the subject of the image was. Consider anewspaper picture editor looking for picture of birds. He makes a queryto a large image database for birds. However, included in the results isa photograph of a couple sitting in a Parisian cafe, in the distance wecan just make out a small bird in the branch of a tree. This pictureclearly does not satisfy the query as the editor intended it. Most humandescriptions of this picture would exclude the bird, because it seemsunimportant. If we could judge the relative importance of each featurewe could describe to what extent it was the subject. So in the case ofthe Parisian cafe the bird would be judged to be quite unimportant.

[0005] According to one aspect of the invention there is provided amethod of image processing comprising

[0006] (a) identifying one or more significant regions having a higherhigh spatial frequency content than does the remainder of the image; and

[0007] (b) performing a subsequent recognition process only upon saididentified regions.

[0008] Other, preferred, aspects of the invention are set out in thesubclaims.

[0009] Some embodiments of the invention will now be described, by wayof example, with reference to the accompanying drawings.

[0010]FIG. 1 is a block diagram of an image recognition apparatus.

[0011] It is assumed that the image to which a recognition process is tobe applied is a photographic image, by which is meant the fixture of animage formed by focussing the light reflected from a real object, suchas by conventional photography, cinematography or a recording (or asingle stored frame) from a television camera. However the utility ofthe invention is not limited to photographic images.

[0012] Thus the apparatus comprises an acquisition device 1 which may bea scanner for scanning photographs or slides, or a device for capturingsingle frames from a video signal. Such devices are well-known, asindeed is software for driving the devices and storing the resultingimage in digital form. The apparatus also has an image store 2 forreceiving the digital image, a processing unit 3 and a program store 4.These items might conveniently be implemented in the form of aconventional desktop computer.

[0013] The program store 4 contains, as well as the driver softwarealready mentioned, a program for implementing the process now to bedescribed.

[0014] The rationale of the present approach is that a recognitionprocess may be improved by confining it to significant parts of theimage, and moreover is based on the premise that the relative focus ofdifferent parts of an image is a good guide to its significance. This islikely to be the case in practice since a photographer (or a personsetting up an automatically operated camera) will aim to have thesubject(s) of the photograph in focus.

[0015] Clearly this places some constraint on the spatial resolutionneeded in the digitised image since with a very coarse digital image thefocus information would be lost in the digitisation process. The testimage used to produce the results shown in FIG. 3 had a resolution of450×360 picture elements.

[0016] The approach adopted here is to identify from, over the area ofthe image, a measure representative of the level of high spatialfrequency detail, and then to apply a decision step in which the regionor regions of the image having a high value of this measure areconsidered to be in focus and those having a low value are considered tobe out of focus, as the presence of sharply defined edges are considereda good indication of focus. The actual required recognition process inthen performed only in respect of the in-focus areas. Alternatively, theestimation of a region's focus may be used to bias the importance offeatures recognised there.

[0017] One method of generating the measure is as follows where numbersrefer to steps in the flowchart of FIG. 2.

[0018] Assume that the photograph is digitised as an M×N image andstored as pixel luminance values p(m,n) (m=0 . . . M−1, n=0 . . . N−1)where p(0,0) is top left, m is horizontal position and n vertical.

[0019] If the image is stored as R, G, B values, luminance values may bere-calculated from these in known manner. Colour information is not usedin forming the measure, though it may be used later.

[0020] Step 10: Edge Detection

[0021] The image is convolved with two kernels representing a filteringoperation, to form an edge map E=e(m,n) (m=0 . . . M−2, n=0 . . . N−2)

[0022] where

e(m,n)=k ₀₀ p(m,n)+k ₀₁ p(m+1,n)+k ₁₀ p(m,n+1)+k ₁₁ p(m+1,n+1)

[0023] and the kernel K is $\begin{bmatrix}k_{00} & k_{01} \\k_{10} & k_{11}\end{bmatrix}\quad$

[0024] In fact two maps E_(1=e) ₁(m,n) and E=e₂ (m,n) are generatedusing the well-known Roberts Cross convolution kernels:$K = {{\begin{bmatrix}1 & 0 \\0 & {- 1}\end{bmatrix}\quad \text{or}\quad K} = \begin{bmatrix}0 & 1 \\{- 1} & 0\end{bmatrix}}$

[0025] Step 11: The elements e of the edge maps each have a value in therange of −255 to +255. As the sign of luminance transitions is notsignificant the modulus is taken; also the two maps are combined bytaking the mean, i.e.${e\left( {m,n} \right)} = {\frac{1}{2}{{{\left. \left( {{{e_{1}\left( {m,n} \right)}} + {{{e_{2}\left( {m,n} \right)}}}} \right. \right)\quad m} = {{0\quad \ldots \quad M} - 2}},{n = {{0\quad \ldots \quad N} - 2}}}}$

[0026] In another embodiment of the invention the two maps are combinedusing the following equation:

e(m,n)={square root}{square root over ((e ₁(m,n)²+(e ₂(m,n)²))} m=0 . .. M−2,n=0 . . . N−2

[0027] The values e(m,n) are stored in a further area of the image store2.

[0028] The two kernels used here have, respectively, maximum sensitivityin the two diagonal directions and thus the combined result is sensitiveto the edges of any orientation. However the resultant map has highvalues only in the immediate vicinity of an edge. Thus the image shownin FIG. 3a produces an edge map which, when displayed as an image,appears as in FIG. 3b. Note that, for clarity, FIG. 3b is shown as anegative of this. The next stage is to spread or merge these so thatneighbouring edges can be recognised as part of a continuous in-focusregion.

[0029] Step 12: One method of achieving this is to convolve the edge mapwith a circular kernel of the form (for a 7×7 kernel). It will beobserved that this has the effect of spatial low-pass filtering of themap E. $C = {\begin{bmatrix}c_{00} & . & . & c_{06} \\. & . & . & . \\. & . & . & . \\c_{60} & . & . & c_{66}\end{bmatrix} = \begin{bmatrix}0 & 0 & 1 & 1 & 1 & 0 & 0 \\0 & 1 & 1 & 1 & 1 & 1 & 0 \\1 & 1 & 1 & 1 & 1 & 1 & 1 \\1 & 1 & 1 & 1 & 1 & 1 & 1 \\1 & 1 & 1 & 1 & 1 & 1 & 1 \\0 & 1 & 1 & 1 & 1 & 1 & 0 \\0 & 0 & 1 & 1 & 1 & 1 & 0\end{bmatrix}}$

[0030] (where we see that the nonzero values of c_(ij) cover a circulararea)

[0031] to produce a focus map:${F = {{f\left( {m,n} \right)} = {{\frac{\sum\limits_{i = 0}^{6}{\sum\limits_{j = 0}^{6}{c_{ij}{e\left( {{m + j - 3},{n + i - 3}} \right)}}}}{\sum\limits_{i = 0}^{6}{\sum\limits_{j = 0}^{6}c_{ij}}}\quad \text{for}\quad m} = {{0\quad \ldots \quad M} - 2}}}},{n = {{0\quad \ldots \quad N} - 2}}$

[0032] where e is deemed zero for positions outside the map E.

[0033] See below for discussion of the kernel size.

[0034] The appearance of such a map, if displayed as an image, would beas in FIG. 3c.

[0035] Step 13. Having obtained a map containing for each pictureelement, a measure f of the focus in the region thereof, it is necessaryto make a decision as to which parts of the image are considered to bein focus and which are not. This can be done simply by comparing themeasures with a threshold value so that only those elements with ameasure above this value are considered to be in focus. The difficultyhowever lies in deciding what this threshold value should be. A fixedvalue is likely to be useful only if the images under consideration arefairly consistent in content. Possible methods of choosing the thresholdwould be on the basis of:

[0036] (a) the total area classified as in focus;

[0037] (b) the number of discrete contiguous regions classified as infocus;

[0038] (c) an estimation of the degree to which the classified region isin focus.

[0039]FIG. 4 shows a typical graph of deemed in-focus area plottedagainst the threshold value, based on approach (a). All example of suchfunctions are a stepped and monotropically decreasing. When thethreshold is set at zero the area is that of the image, whilst when thethreshold is at maximum the area is zero. The shape of the curve willdepend on the distribution of the focus in the image. If the focusedsubject is a small area against a large uniformly out of focusbackground, the function will decrease very quickly as the background isremoved at a low threshold. If there is a distribution of focus spreadequally across the spectrum and image the area will decrease with aconstant gradient. If the majority of the image is in focus and only asmall area is not, the function will decrease slowly initially anddecrease more rapidly at a higher threshold. The graph shown ischaracteristic of such an image.

[0040] In practice it is found that a “good” threshold is often foundthe “neck” of the area curve, that is the point on the curve which liesmidway along the perimeter of the curve.

[0041] A practical solution based on approach (a) is to chose athreshold value such that if the area and threshold T are bothnormalised (i.e. a=A/A_(tot) where A_(tot) is the total area (MN) of theimage and t=T/255) then t=a(t).

[0042] Step 14: Having selected the threshold the result r(m,n) for anypicture element is

[0043] r=1 for f≧t

[0044] r=0 for f<t

[0045] The image of FIG. 3a, with all parts for which r=0 set to zero(black) is shown in FIG. 3d.

[0046] Where this process identifies more than one discrete region ofthe picture as being in focus, a focus score for each such region can ifdesired be generated which is the sum of the focal activity for thatregion (i.e. Σf(m,n) as defined above over that region, or morepreferably the product of this and the area of the region). This couldbe used by a subsequent recognition process to weight the relativeimportance of the different regions.

[0047] Step 15: The desired recognition process may now be performed, itbeing understood that picture elements for which r=0 are excluded fromthe recognition process, or, alternatively that the picture elements forwhich r=0 are set to black before the recognition process is performed.The actual recognition process will not be described here, since manyalternatives are available and, apart from the above-mentionedlimitation to picture elements for which r=1, are performed in theconventional manner. Some recognition techniques are discussed in:

[0048] Flickner, M et al (1995) “Query by image and video content: theQBIC system” IEEE Computer, 28(9), 23-32.

[0049] Treisman A, “Features and objects in visual processing”; NewScientist, November 1986, page 106.

[0050] McCafferty, J D “Human and Machine vision: Computing perceptualOrganisation” West Sussex, England: Ellis Horwood 1990.

[0051] Mohan, R and Netavia, R “Using perceptual organisation to extract3D structures”, IEEE Trans. Pattern. Anal. Machine Intell. Vol. 11 pp.355-395, 1987.

[0052] Before this method can be implemented, a decision has to be madeas to the size of the circular kernel C to be used to combine the edgesinto contiguous areas.

[0053] Whilst the invention should not be construed by reference to anyparticular theory, it is thought that a constructive approach to thisquestion is to consider the perception of a human observer when viewingthe image.

[0054] The retina lies at the back of the eye and is covered withlight-sensitive cells, which send an electrical representation ofreality to the brain, along the optic nerve. There are two differentclasses of cells on the retina, rods and cones. Rods are very sensitiveto illumination and they are responsible for our perception of movement,shape, light and dark; however they are unable to discriminate colours.The majority of the 125 million rods are to be found in the periphery ofthe retina. Our ability to see colour and details is in the function ofthe cones. The majority of the eye's cones are packed densely into thefovea centralis where there are 150,000 per square millimetre. It istherefore this part of the retina where our vision is most acute. Forthis reason we move our eyes to project the region of interest onto thefovea, these jumps are known as saccades. Given that the angle visibleto the fovea subtends approximately 2 degrees it is possible to predictthe number of picture elements most acutely visible to the viewer,knowing their viewing distance and screen resolution. This is anaturally inspired definition of neighbourhood, which is used todetermine the kernel size.

[0055] Thus it is suggested that a guideline for the kernel size shouldcorrespond to an area equivalent to a viewing angle of 2 degrees at agiven picture size and viewing distance. If the size distance anddistance are known, this may be calculated, or a reasonable assumptionmay be made.

[0056] For example, suppose that we want an area corresponding to a 2degree field of view on a 17 or 21 inch monitor with a 3:4 aspect ratioand a viewing distance of 0.5 metres. A 2 degree field at 500 mm gives acircle on the screen of radius 500 sin (1²)=8.7 mm—i.e. an area of 239sq. mm. A typical 17″ monitor has a screen area 240×320 mm so that wouldrepresent a kernel of area 239/(240×320)=0.0031 or {fraction(1/320)}^(th) of the screen area. Assuming that a 21″ screen has lineardimensions 21/17 times this, then we are looking at {fraction(1/490)}^(th) of the screen area. This suggests a kernel area oftypically {fraction (1/300)} or {fraction (1/500)} of the image area.

[0057] The kernel size used in the example of FIG. 3 was 14 pixels indiameter.

[0058] Some variations on the above theme will now be discussed. Theedge detection process of step 10 has the merit of being a relativelysimple process computationally, and has produced good results inpractice; however a more sophisticated filtering process could be usedif desired, or blocks of picture elements might be analysed by means ofa transform such as the discrete cosine transform, and filtering byretaining only higher order coefficients. This might be particularlyattractive when processing images that are already digitally coded bymeans of a transform-based coding algorithm. Similarly other circularkernels than the ‘flat’ example given above might be of use.

[0059] In the thresholding process, the procedure may be modified tolimit the resulting number of discrete in-focus region rather than thetotal area of such regions (option (b) mentioned above), the rationalebeing that it is not generally useful to identify more than seven suchareas.

[0060] Thus an alternative would be to calculate

[0061] (i) the threshold as described in step 13;

[0062] (ii) the highest threshold which gives fewer than eight in-focusregions;

[0063] and choose the lower of the two. That is, if the originalthreshold gives more than seven regions, it is reduced until it nolonger does.

1. A method of image processing comprising (a) identifying one or moresignificant regions having a higher high spatial frequency content thandoes the remainder of the image; and (b) performing a subsequentrecognition process only upon said identified regions.
 2. A methodaccording to claim 1 in which step (a) comprises: creating measures ofhigh spatial frequency activity as a function of position with theimage, comparing the measures with a threshold value, and signalling assignificant regions those parts of the image having values of themeasure which is indicative of greater high spatial frequency activitythan the threshold value.
 3. A method according to claim 2 in which themeasures are created by applying a high-pass spatial filtering operationto the image in two different directions to produce first and secondmeasures; combining the first and second measures to produce a thirdmeasure; and applying a low-pass spatial filtering operation to thethird measure.
 4. A method according to claim 2 or 3, in which thethreshold value is chosen adaptively in dependence on the statistics ofthe measures over the whole image.
 5. A method according to claim 2, 3or 4 in which the threshold value is set, for the particular image, at alevel such that the number of significant regions does not exceed apredetermined number.